Comparative analysis of gene ontology-based semantic similarity measurements for the application of identifying essential proteins

Identifying key proteins from protein-protein interaction (PPI) networks is one of the most fundamental and important tasks for computational biologists. However, the protein interactions obtained by high-throughput technology are characterized by a high false positive rate, which severely hinders the prediction accuracy of the current computational methods. In this paper, we propose a novel strategy to identify key proteins by constructing reliable PPI networks. Five Gene Ontology (GO)-based semantic similarity measurements (Jiang, Lin, Rel, Resnik, and Wang) are used to calculate the confidence scores for protein pairs under three annotation terms (Molecular function (MF), Biological process (BP), and Cellular component (CC)). The protein pairs with low similarity values are assumed to be low-confidence links, and the refined PPI networks are constructed by filtering the low-confidence links. Six topology-based centrality methods (the BC, DC, EC, NC, SC, and aveNC) are applied to test the performance of the measurements under the original network and refined network. We systematically compare the performance of the five semantic similarity metrics with the three GO annotation terms on four benchmark datasets, and the simulation results show that the performance of these centrality methods under refined PPI networks is relatively better than that under the original networks. Resnik with a BP annotation term performs best among all five metrics with the three annotation terms. These findings suggest the importance of semantic similarity metrics in measuring the reliability of the links between proteins and highlight the Resnik metric with the BP annotation term as a favourable choice.


Introduction
Proteins are crucial components of cell and tissue structures and are cornerstones used by an organism to maintain normal life activities. Due to the different roles each protein plays in the life activities of organisms, proteins are divided into essential proteins and nonessential proteins. The deletion or elimination of essential proteins may result in normal cellular function disorders or diseases and may even affect the development and survival of organisms [1,2]. Previous studies have shown that when a virus attacks the human body, it attacks essential proteins first [3]. For instance, when studying the novel coronavirus, the most important aspect is to determine several possible target proteins and then use super-large computeraided drug screening to find effective antiviral drugs. Therefore, identifying key proteins has vital application prospects in disease diagnosis [4], drug discovery [5], and drug design [6].
Traditional biological experiments can only be carried out in a limited number of species and are expensive and time consuming [7]. Fortunately, with the rapid development of highthroughput technology, many PPI data have been accumulated, and these provide a convenient condition for identifying essential proteins with computational methods.
PPI networks retrieved from high-throughput techniques are incomplete and inherently noisy [21]. The reliability of yeast two-hybrid assays is approximately 50%, even for the wellstudied Saccharomyces cerevisiae species; this impairs the prediction performance of the available topology-based methods.
To overcome the influence of false positive data in PPI networks, two categories of methods have been developed to improve the performance of identifying essential proteins. The first category identifies essential proteins by combining the topological properties of PPI networks with various biological data, such as Gene Ontology (GO) annotation data [22][23][24][25][26][27], gene expression profiles [23,25,[27][28][29][30][31], subcellular localizations [24,25,31], the domain features of proteins [32], orthologous information [30,33], and protein complex information [34,35]. Previous studies have demonstrated that the efficient and effective integration of multiple sources of data could yield better results for identifying essential proteins. For example, Kim [22] proposed that adding gene-level annotation information, such as GO terms, to detect essential proteins would result in higher accuracy than that of existing methods. Li et al. [29] introduced a novel essential protein prediction algorithm named CPPK. CPPK predicts key proteins with a combination of network topology properties and gene expression data. Zhang et al. [23] developed a new method named TEO that combines PPI networks, and gene expression profiles with GO annotation terms, and it achieved higher accuracy in predicting key proteins than previously developed. Peng et al. [32] developed a method called UDoNC that utilize protein domain information and the topology of given PPI network. Lei et al. [24] introduced a novel strategy named RSG using RNA-Seq, GO information, and subcellular localization. Zhang et al. [25] developed TEGS, a new strategy to predict key proteins, which improved prediction accuracy by integrating network topology with subcellular localization information, gene expression profiles, and GO annotation datasets. Peng et al. [33] developed a novel measure to predict key proteins by adding orthologous data. Zhang et al. [30] designed OGN by using gene expressions, orthologies, and network topologies to identify key proteins. Li et al. [34] introduced a novel idea that combines protein complexes information with the topological properties of PPI networks.
The methods in the second category predict key proteins based on refined networks by filtering the false positive interactions in the original network. For instance, Kim et al. [26] designed a motif-based method named MCGO, which utilizes Gene Ontology annotation data to prune several uninformative edges from the given network. Li et al. [31] proposed a novel approach to reconstruct PPI networks by using gene expression information and subcellular localization information. Liu et al. [36] developed a new algorithm, EPPSO, to identify key proteins according to improved particle swarm optimization and reconstructed PPI networks by combining the topology information of the PPI networks with other biological information. Lei et al. [27] presented RWEP, which utilizes GO terms and gene expression data to construct a new weighted PPI network, and a random walk with the restart algorithm is applied to quantify the essentiality value of the protein. Simulation results show that RWEP dominates topology-based approaches in predicting key proteins. However, the performance of these approaches is still unsatisfactory, and many methods are complicated and involve many steps, which might hinder their wide application in biological research.
GO annotation is a system of uniform and normative descriptions of the genes and gene products of all species. A GO annotation collects information on the molecular function (MF), biological process (BP), and cellular component (CC) of different organisms. The GO-based semantic similarity metric (SSM) is a numerical measure that is used to estimate the semantic intimacy between two terms and is widely used for measuring the functional similarities between proteins [37][38][39]. Five widely used SSMs, Jiang [40], Lin [41], Rel [42], Resnik [43], and Wang [44], are applied to calculate the GO semantic similarity values at present. However, each of the SSMs focuses on characterizing particular aspects of GO annotation terms and has strengths as well as weaknesses. The advantages and disadvantages of these SSMs in evaluating GO semantic similarities are important for predicting key proteins.
In this paper, we comprehensively discuss the aforementioned five semantic similarity measurements in combination with three subontology (BP, CC, and MF) terms on the identification of essential proteins. Six centrality methods (the BC, DC, EC, NC, SC, and aveNC) are applied on refined GO-PPI networks and the results are compared with those of the same methods on the original PPI networks. Extensive comparisons have been conducted under different conditions, and the simulation results offer a reference to biologists when investigating the essential proteins of PPI networks.

Methods
In this part, six conventional centrality methods (the BC, DC, EC, NC, SC, and aveNC) are reviewed briefly. Then, refined PPI network construction methods are described in detail. Additionally, the utilized datasets and evaluation metrics are presented.

Centrality methods
PPI networks are abstracted into graph structures, which are denoted as PPI = (P, E), where P is composed of proteins and E represents the set of interactions between proteins. PPI networks are stored as adjacent matrices. The six centrality calculation methods are calculated as follows.

BC
where S p (i, j) represents the number of shortest paths between protein i and j that go through protein p and S(i, j) represents the number of shortest paths between protein i and protein j. Considering the global characteristics of PPI networks, this method can identify some nodes whose degrees are not high but play a vital role in the connection of the given network.

DC
where deg(p) represents the number of proteins connected to p directly, which is called the degree of p. And a p,u 2 A represents the interactions between proteins p and u.

EC
where α is a eigenvector of the adjacency matrix A and α max (p) is the pth component of the eigenvector belonging to the maximum eigenvalue λ max .

NC
where N p and N u represent the neighboring sets of proteins p and u, respectively. ECC is the edge clustering coefficient. This method characterizes the connection relationships between a node and its neighbors; that is, the similarity of the relationship between two proteins is described by calculating the number of common neighbor nodes.

SC
SCðpÞ ¼ where μ l (p) represents the number of loops whose starting and ending proteins are p and the lengths of these loops are l. In complex networks, essential proteins tend to form dense subgraphs. The shorter the loop is, the more likely the protein is to be in a dense subgraph and to be a key protein.

aveNC
where N p represents the set of protein p's neighbors. The significance of a protein is measured by its neighbors.

Constructing refined PPI network by applying GO-based SSMs
There are two kinds of measures used to record confidence scores for a PPI network. One relies on interaction data [45], and the other takes gene expression values [46], functional similarities [39,47], and other information into consideration [37]. According to the basic idea that proteins interacting in the same cell have a higher possibility of being involved in a similar biological process than that do not interact, we assume that the protein pairs with smaller semantic similarity values are more likely to be false positive links. Five widely used methods, Jiang, Lin, Rel, Resnik, and Wang, are applied to compute semantic similarities based on the GO terms between proteins, and these are denoted as confidence scores. Wang determines the confidence scores between two proteins according to the locations of their corresponding GO terms in the GO graph and their ancestor terms' relationships. The other four methods are based on information content (IC), which depends on the probabilities of the two GO terms involved and their closest common ancestor terms in the corpus of the GO annotation information.
The details of the five SSMs (semantic similarity metric) are shown as follows: 1. Resnik Resnik believes that information content (IC) is the most informative common ancestor (MICA) [48]. The similarity between protein pairs m and v in this method is denoted as where C represents the set of common ancestors of m and v. The IC mentioned above is denoted as where p(t) represents the probability of occurrence in the GO corpus and IC is used to express the specificity of a protein.

Lin and Jiang
It seems that the performance of Resnik is valid for calculating the similarity of two terms, but it cannot distinguish between terms that have the same MICA. To tackle this problem, Lin and Jiang developed new methods with comprehensive consideration of the ICs between protein pairs and their MICAs. The similarity of two proteins based on the Lin and Jiang methods is defined as 3. Rel Shortcomings still exist in the approaches developed by Lin and Jiang. The similarity between two terms is overestimated when a protein is an ancestor of another. In addition, these approaches ignore the specificities of the two terms. By combining Resnik and Lin, Rel presented a novel measure to capture the similarity between two terms. The similarity between two proteins is defined as

Wang
Wang is a hybrid method that combines the number of common ancestors with the locations of these ancestors in the GO graph when calculating the similarity between two terms. GO terms are presented as directed acyclic graphs (DAGs). Suppose that where P v contains the ancestor terms of v including itself, and E v is composed of edges that connect the GO terms in G v . Other terms closer to v in G v contribute more to its semantics. The contribution of a protein u to the semantics of protein v in G v is defined as the S-value of u and is calculated as where ω e (0 < ω e < 1) is the semantic contribution factor for edge e 2 E v that links term u with its child term u 0 . And SV(v) is used to compare the semantics of two GO terms, and SV(v)is defined as The semantic similarity between protein pairs m and v is denoted as In this article, we apply five GO-based semantic similarity measurements to measure the reliability of protein pairs. For each SSM, we first compute the confidence scores for all of the protein pairs, and then construct refined PPI networks by filtering the interactions with low confidence scores. The refined PPI networks we obtain by measuring the GO semantic similarity are named GO-PPI for short, and the network refined by using the Resnik metric under the BP annotation term is named Resnik-BP GO-PPI for short. The main idea of constructing a refined GO-PPI network is shown in Fig 1.

Experimental data
To compare the performance of these centrality methods under different combinations of strategies, we choose the well-studied Saccharomyces cerevisiae PPI data for experiments, as they are widely applied for testing the performance of new methods. The datasets include the YDIP dataset composed of 5093 proteins and 24743 interactions, the new DIP dataset, which includes 4928 proteins and 17201 interactions, the Krogan dataset containing 7123 interactions among 2708 proteins, and the Krogan Extended dataset, which consists of 3672 proteins with 14317 interactions. A summary of these datasets is presented in Table 1.
The GO annotation information of each protein is downloaded from the Saccharomyces Genome Database, which was released on September 10th, 2020.

Evaluation metrics
To measure the efficiency of the proposed strategy, we calculate the numbers of key proteins predicted correctly among the top 600 ranked proteins, and the corresponding prediction precisions of the six topology-based methods are also calculated under the original PPI network and refined GO-PPI network. The prediction precision is denoted as where TP describes the number of true positives, and FP describes the number of false positives.

Results and discussion
To evaluate whether the performance of the reconstructed GO-PPI network is better than that of the corresponding original PPI network in identifying key proteins, six topology structurebased methods (the BC, DC, EC, NC, SC, and aveNC) are applied in the experiments. We compare the numbers of key proteins identified properly and the prediction precisions under different types of strategies. The threshold for GO semantic similarity is set to 0.33 for filtering the unreliable links in the PPI networks.

Analysis of the original network and refined GO-PPI network
The number of interactions in a network influences the speed of calculation for identifying essential proteins. The lower the number of interactions, the less time is required for the calculation. Therefore, we compute the number of interactions and the portions of key proteins under the original PPI network and refined the GO-PPI network for the YDIP dataset. As shown in Table 2, the number of interactions declines dramatically after filtering the links with low-confidence scores, and more than half of the interactions are filtered, so the computational efficiency is greatly improved. Furthermore, the numbers of proteins and key proteins are reduced, but the portion of essential proteins is increased, which is more beneficial for identifying key proteins. For example, in networks with the application of the Resnik metric, the proportions of essential proteins under the three subontologies (the BP, CC, and MF) reach 39.83%, 41.55%, and 37.82%, respectively, while they are 22.91% in the original PPI network.
In the meantime, we study the interactions that rank among the top 600. As the numbers of interactions are different for the original network and the reconstructed GO-PPI network, we compute the proportions of the interactions between essential protein pairs (Ess-ess), essential and nonessential protein pairs (Ess-noness), and nonessential pairs (Noness-noness), and the results for the YDIP dataset under the BP subontology are shown in Table 3. It can be seen that the portion of Ess-ess interactions is significantly improved under the five refined GO-PPI networks, and the portions of Ess-noness and Noness-noness interactions under the GO-PPI network are much lower than those under the original PPI network. We can also see that Wang achieves the best performance compared with those of the other four SSMs. For instance, the portion of essential pairs reaches 57.99% when using the NC method under Wang, which is the highest for the six different networks. And the interactions between the essential and nonessential pairs are only 12.87% of total interactions under Wang versus 37.27% under the original PPI network for the SC method.

Comparison of the numbers of true predictions under different strategies
In this part, we do a systematic evaluation of the performance of the newly constructed networks on the four test datasets. For each dataset, we adopt five SSMs to calculate confidence scores for the protein pairs in the PPI network under the three GO annotation terms (the BP, MF, and CC) and obtain fifteen kinds of refined GO-PPI networks. Six centrality methods are applied to predict the key proteins of the newly constructed GO-PPI network and the original PPI network. Table 4 presents the numbers of essential proteins correctly identified from the top 600 candidate proteins of the original network and refined GO-PPI network with different SSMs under the three sub-ontologies (the BP, CC, and MF). As seen from Table 4, for the YDIP dataset, the numbers of essential proteins correctly identified under the six centrality methods on each of the newly constructed GO-PPI networks are consistently larger than those under the corresponding original PPI networks. For example, compared to the original network, the EC method yields an improvement of 57.01% on the Wang-BP (the Wang method under BP subontology) network, and the aveNC method provides an improvement of 300% on the Resnik-BP network. In terms of three subontologies, the performance of these methods under the refined GO-PPI network obtained with BP annotation term is significantly better than it under CC and MF annotation terms, especially for the Resnik and Wang methods.
To verify the superiority of the newly proposed strategy, we calculate the number of key proteins identified correctly by each method under three GO subontology terms for the reduced DIP PPI dataset, the Krogan dataset, and the Krogan Extended dataset. The calculation results are listed in Tables 5-7.
For the new DIP PPI dataset, the comparison results are shown in Table 5. We can observe that the six centrality methods perform best under the refined GO-PPI network constructed by using the Resnik metric with the BP subontology, suggesting that this network is relatively more accurate and complete than it is under the MF and CC subontologies. However, for the MF and CC subontologies, some of the centrality methods perform poorly under the refined GO-PPI network, such as the BC method under the Jiang-CC (the Jiang method under the CC subontology) PPI network and the DC method under the Resnik-MF (the Resnik method under the MF subontology) PPI network. The maximum number of essential proteins predicted by the NC method in all five newly constructed PPI networks under the MF subontology is 306, which is compared to the 318 correctly predicted essential proteins under the original PPI network. Considering the number of interactions under the refined GO-PPI network in the MF subontology (Table 2), this is might due to the GO annotation under MF is incomplete for the protein pairs in the new DIP dataset; therefore, the confidence scores of many true interacting protein pairs are assigned to 0, and the refined network constructed by using the five SSMs is relatively sparse, which hinders the performance of the NC centrality approach in identifying key proteins. As seen from the results obtained using the Krogan dataset in Table 6, the performance of these six centrality methods under the refined GO-PPI networks constructed by using the five SSMs with the BP and CC annotation terms dominates the number of true key proteins predicted under the original networks. In particular, under the GO-PPI network filtered by the Wang method under the BP term, the numbers of correctly identified proteins achieved by the two centrality methods (the DC and NC) reach 336, which is significantly larger than that on the original PPI network. For the CC annotation term, the network filtered by using the Resnik metric is relatively more precise than other methods in predicting key proteins. Compared to the number of correct predictions obtained under the original PPI network, more than half of the centrality methods performed better under the newly constructed network with the MF sub-annotation term, except for the DC and NC methods.
Similar results are obtained on the Krogan Extended dataset and listed in Table 7. The number of key proteins truly predicted under the newly refined GO-PPI networks constructed with the BP subontology is consistently larger than that under the original PPI networks, and the refined network dominates the the network constructed with the CC and MF subontologies in terms of performance.
To further investigate the performance of the six centrality methods under the newly refined networks, we take the network constructed by using the Resnik metric with BP subontology for the YDIP dataset as an example. We calculate the numbers of key proteins predicted by these centrality approaches among the top 100, 200, 300, 400, 500, and 600 ranked candidates. As shown in Fig 2, the performance of these six topology-based methods is highly improved under the reconstructed GO-PPI network in terms of the number of key proteins identified correctly. Particularly, for the SC method, 91 out of 100 candidate predicted proteins are correctly identified, which is significantly more than those predicted by all of the other state-of-the-art approaches. When compared to the results of the original PPI network, 85.22%

Comparison of prediction precision for the six centrality methods
To validate the advantage of the reconstructed GO-PPI network in predicting key proteins intuitively, six centrality approaches (the BC, DC, EC, NC, SC, and aveNC) are taken to predict key proteins under the original PPI network and reconstructed GO-PPI network. Fig 3 shows the prediction precision comparison for the six centrality approaches under the original PPI network and GO-PPI network reconstructed by using the Wang method with the BP subontology information for the YDIP dataset. Fig 3 shows that the prediction precisions of these six methods under the newly constructed GO-PPI network show significant improvements over those obtained with the original PPI network.

Comparison of ROC curves
To further exhibit the performance of proposed strategy, we compared the ROC curves of different methods under original PPI networks and corresponding GO-PPI networks. The top 600 ranked proteins predicted by each method are assumed as essential, the rest proteins are non-essential. For the gold-standard essential proteins in GO-PPI is obtained from the original true essential protein sets and filtered the proteins that are not in GO-PPI network. The rank value of each protein in original PPI network and GO-PPI network are normalized, and the true positive rate as well as false positive rate is calculated by using the threshold value varies in [0, 1]. We draw the ROC curve by using the obtained true positive rate and false positive fate. AUC means the area under the ROC curve and calculated by using trapz function in Matlab. The comparison of ROC curves as well as AUC value under original new DIP PPI and YDIP PPI network are shown as following Figs 4 and 5. As shown in Figs 4 and 5, the ROC curves under GO-PPI network is higher than the corresponding original PPI network, suggesting that the GO-PPI network we constructed is reliable for predicting essential proteins.

Analysis of the effect of the threshold
Since the new GO-PPI network is constructed by filtering the unreliable links in the original PPI network, we need to choose an appropriate threshold to distinguish false positive data and real interactions. However, the threshold value is related to the SSMs and the quality of a given PPI network, and different thresholds should be set for different SSMs to achieve the best performance.

PLOS ONE
To investigate the effect of the threshold on the performance of the methods in identifying essential proteins, we plot the true numbers of key proteins identified among the top 100, 200, 300, 400, 500, and 600 candidates as functions of the threshold value for the YDIP Jiang-BP network in Fig 6. As shown in Fig 6, the numbers of correct predictions increase with increasing threshold value for all of the methods, especially the DC, SC, and aveNC methods. The results show that GO semantic similarity is efficient in filtering unreliable links in PPI networks, and almost all of the considered methods achieve the maximum number of correctly predicted essential proteins with a relatively large threshold value.

Conclusions
Predicting essential proteins by developing computational methods from PPI networks has been a hot topic in recent years. However, the PPIs obtained by high-throughput technology at present have high false positive rates. False interactions in PPI networks have great effects on the performance of computational methods in terms of predicting key proteins. Semantic similarity measures have been shown to be useful for assessing the confidence scores between linked protein pairs. The best of the five current widely used semantic similarity measurements for selecting appropriate metrics to measure the reliability of interactions remains unclear.
This paper presents a comparison between GO-PPI networks newly constructed by five semantic similarity methods with three GO annotation terms and corresponding original PPI networks. The six topological-based centrality methods (the BC, DC, EC, NC, SC, and aveNC) are used to calculate the numbers of correct predictions and the precisions for the 600 top- ranked candidate proteins under the newly constructed GO-PPI networks and original networks. The comparison results suggest that the prediction accuracies under each of the newly constructed GO-PPI networks are consistently higher than those under the original PPI network. In particular, the networks constructed by using the semantic similarity metrics of Resnik and Wang under the BP annotation term are most reliable for predicting essential proteins among these topological-based centrality methods. These results suggest that constructing a new PPI network by using the Resnik and Wang metrics under the BP annotation term can filter out some false positive data effectively and improve the quality of the network, which is also the direction of future research.